By Felipe Hoffa (@felipehoffa, /r/bigquery)
To look further into how to read these results to understand history, look at:
In [13]:
# Some pre-flight work:
# - Libraries.
# - Patch to handle authentication automatically on Google Compute Enginefrom oauth2client.gce import AppAssertionCredentials
from bigquery_client import BigqueryClient
from pandas.io import gbq
import pandas as pd
def GetMetadata(path):
import urllib2
BASE_PATH = 'http://metadata/computeMetadata/v1/'
request = urllib2.Request(BASE_PATH + path, headers={'Metadata-Flavor': 'Google'})
return urllib2.urlopen(request).read()
credentials = AppAssertionCredentials(scope='https://www.googleapis.com/auth/bigquery')
client = BigqueryClient(credentials=credentials,
api='https://www.googleapis.com',
api_version='v2',
project_id=GetMetadata('project/project-id'))
gbq._authenticate = lambda: client
How many rows GDELT has for each country?
We are going to use a view specially defined for that. 'sample_views.country_date_matconf_numarts' counts the number of material conflicts per day per country.
To see the view definition, go to https://bigquery.cloud.google.com/table/gdelt-bq:sample_views.country_date_matconf_numarts and click on 'Details'.
In [2]:
query = """
SELECT country, SUM(c) AS count
FROM [gdelt-bq:sample_views.country_date_matconf_numarts]
GROUP BY country
ORDER BY count DESC
LIMIT 10
"""
gbq.read_gbq(query)
Out[2]:
Focusing solely in Egypt. Let's plot the number of material conflicts reported for each day over the last 35 years:
In [9]:
query = """
SELECT date, c AS count
FROM [gdelt-bq:sample_views.country_date_matconf_numarts]
WHERE country='Egypt'
ORDER BY date
"""
data=gbq.read_gbq(query)
data.index=pd.to_datetime(data['date']*1000)
x = data['count'].plot()
What if we only focus on the 30 days previous to January 27th, 2011.
In [28]:
query = """
SELECT country, date, c AS count
FROM [gdelt-bq:sample_views.country_date_matconf_numarts] a
CROSS JOIN (SELECT i FROM [fh-bigquery:public_dump.numbers_255] WHERE i < 30) b
WHERE country='Egypt'
AND date+i*86400000000 = PARSE_UTC_USEC('2011-01-27')
"""
data_egypt=gbq.read_gbq(query)
data_egypt.index=pd.to_datetime(data_egypt['date']*1000)
x = data_egypt['count'].plot()
With that timeline defined, we can look for similar timelines across all years and countries.
To make this work, we used the trick explained on http://stackoverflow.com/questions/24923101/computing-a-moving-maximum-in-bigquery/24943950#24943950.
In [12]:
query = """
SELECT
STRFTIME_UTC_USEC(a.ending_at, "%Y-%m-%d") ending_at1,
STRFTIME_UTC_USEC(b.ending_at-30*86400000000, "%Y-%m-%d") starting_at2,
STRFTIME_UTC_USEC(b.ending_at, "%Y-%m-%d") ending_at2,
a.country, b.country, CORR(a.c, b.c) corr, COUNT(*) c
FROM (
SELECT country, date+i*86400000000 ending_at, c, i
FROM [gdelt-bq:sample_views.country_date_matconf_numarts] a
CROSS JOIN (SELECT i FROM [fh-bigquery:public_dump.numbers_255] WHERE i < 30) b
) b
JOIN (
SELECT country, date+i*86400000000 ending_at, c, i
FROM [gdelt-bq:sample_views.country_date_matconf_numarts] a
CROSS JOIN (SELECT i FROM [fh-bigquery:public_dump.numbers_255] WHERE i < 30) b
WHERE country='Egypt'
AND date+i*86400000000 = PARSE_UTC_USEC('2011-01-27')
) a
ON a.i=b.i
WHERE a.ending_at != b.ending_at
GROUP EACH BY ending_at1, ending_at2, starting_at2, a.country, b.country
HAVING (c = 30 AND ABS(corr) > 0.254)
ORDER BY corr DESC
LIMIT 10
"""
data=gbq.read_gbq(query)
data
Out[12]:
Those are the 10 periods that most closely resemble the timeline of material conflicts in Egypt before January 27th, 2011.
Let's plot Chad 30 days before October 19th, 2013.
In [34]:
query = """
SELECT country, date, c AS count
FROM [gdelt-bq:sample_views.country_date_matconf_numarts] a
CROSS JOIN (SELECT i FROM [fh-bigquery:public_dump.numbers_255] WHERE i < 30) b
WHERE country='Chad'
AND date+i*86400000000 = PARSE_UTC_USEC('2013-11-18')
"""
data_chad=gbq.read_gbq(query)
data_chad.index=pd.to_datetime(data_chad['date']*1000)
data_chad['count'].plot()
Out[34]:
In fact, both periods look remarkably similar (but not necesarily significant, more analysis is needed before reaching any conclusions):
In [33]:
data_egypt['count'].plot()
Out[33]:
To look further into how to read these results to understand history, look at: